In this section I import the data and prepare it for analysis.
Code
```{r setup}#| warning: false#| message: falselibrary(tidyverse)library(plotly)library(dplyr)library(tidyr)library(knitr)library(table1) #Create HTML Tables of Descriptive Statistics https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html#library(OMTM1) #https://github.com/schildjs/OMTM1/library(Hmisc)library(rms) # Regression Modeling Strategies by Frank https://cran.r-project.org/web/packages/rms/index.htmllibrary(modelsummary) #Summary Tables and Plots for Statistical Models and Data: Beautiful, Customizable, and Publication-Ready https://cran.r-project.org/web/packages/modelsummary/index.htmllibrary(scales) # The scales packages provides the internal scaling infrastructure used by ggplot2, and gives you tools to override the default breaks, labels, transformations and palettes. https://scales.r-lib.orglibrary(viridis) #colorslibrary(cowplot) #allows me to use plotgridlibrary(gridExtra) #adding tables to plotslibrary(visdat) #shows missing datalibrary(GGally) #makes pairs plotslibrary(sandwich) #for robust standard errorslibrary(gtsummary) #makes the tables look nicersetwd("/Users/lisalevoir/BIOS7351_Collab/data_project2") #this line I would need to run in the consoleknitr::opts_knit$set(root.dir ="/Users/lisalevoir/BIOS7351_Collab/github/BIOS_Collaboration/project_2_analysis") #now I set global options for knitting, I also had to toggle global options > R Markdown > evaluate chunks in current directory#import the datadat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/combined_data_203.csv")#to compare that the merging went as expectedVUMSdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_VUMS.csv")HMSdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_HMS.csv")UVAdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_UVA.csv")```
1.0.1 Inclusion/Exclusion
Criteria to exclude students who most likely took a scored exam:
Any PhD students (n = 2)
Any 5th year program students (n = 19)
M4 students at Vanderbilt (n = 5)
Students who did not complete either Step survey (n = 2)
Students who specifically stated they took a scored Step 1 (n=1)
Based on our criteria we would exclude record IDs:
[1] "uworld_percent_step1_1" "uworld_percent_step1_2"
[1] "amboss_percent_step1_1" "amboss_percent_step1_2"
[1] "length_step1_1" "length_step1_2"
[1] "practicetest_step1_1" "practicetest_step1_2"
[1] "full_test_practice_step1_1" "full_test_practice_step1_2"
[1] "push_step1_1" "push_step1_2"
[1] "push_practice_test_step1_1" "push_practice_test_step1_2"
[1] "push_nbme_practice_score_step1_1" "push_nbme_practice_score_step1_2"
[1] "push_uw_practice_score_step1_1" "push_uw_practice_score_step1_2"
[1] "final_nbme_practice_score_step1_1" "final_nbme_practice_score_step1_2"
[1] "final_uw_practice_score_step1_1" "final_uw_practice_score_step1_2"
[1] "score_step1_1" "score_step1_2"
[1] "other_resources_step1_1" "other_resources_step1_2"
[1] "other_step1_1" "other_step1_2"
[1] "changes_step1_1" "changes_step1_2"
[1] "Above is a list of columns I have combined for you. Hope it looks right!"
[1] "uworld_percent_step2_1" "uworld_percent_step2_2"
[1] "amboss_percent_step2_1" "amboss_percent_step2_2"
[1] "length_step2_1" "length_step2_2"
[1] "practicetest_step2_1" "practicetest_step2_2"
[1] "full_test_practice_step2_1" "full_test_practice_step2_2"
[1] "practice_score_step2_1" "practice_score_step2_2"
[1] "practice_test_step2_1" "practice_test_step2_2"
[1] "score_step2_1" "score_step2_2"
[1] "target_score_step2_1" "target_score_step2_2"
[1] "Above is a list of columns I have combined for you. Hope it looks right!"
There were 203 participants after the the clients did data exclusion by hand, as compared to the total survey size which was 182 unique individuals included in the step 1 data.
These are the record IDs that were excluded by the collaborator: VUSM 23, VUSM 60, HMS 37, HMS 41, HMS 49, UVASOM 33, UVASOM 80, UVASOM 84.
For our analysis, I originally included everyone from the 3 school specific data sets. Then, we flagged individuals to remove and I remade the source data set. These are the records IDs that we decided to exclude:
VU record ids: 23, 26, 39, 40, 54, 3, 8, 12, 49, 60, 62, 64
HMS record ids: 1, 21, 28, 30, 34, 37, 39, 41, 44, 47, 49, 61
UVA record ids: 33, 47, 80, 81, 83
Below are tables for Step 1 and Step 2 response inclusion by school:
Code
kable(table(step1_complete$schoolid), caption ="Number of Step 1 responses included by school")
Number of Step 1 responses included by school
Var1
Freq
HMS
50
UVa
79
VU
53
Code
kable(table(step2_complete$schoolid), caption ="Number of Step 2 responses included by school")
Number of Step 2 responses included by school
Var1
Freq
HMS
39
UVa
65
VU
34
Notice that unfortunately for Step 2 analysis, we had to drop 44 individuals who did not report a step 2 score. These raw counts and frequencies of people who did not give a step 2 score (and are therefor not eligible for analysis) are listed by institution below.
Code
# describing the missingnesskable(table(drop_these_with_no_outcome$schoolid), caption ="Number of missing Step 2 scores by institution")
Number of missing Step 2 scores by institution
Var1
Freq
HMS
11
UVa
14
VU
19
Code
num <-as.vector(table(drop_these_with_no_outcome$schoolid))denom <-as.vector(table(step2_complete$schoolid))nonresponse_freq <-setNames(c(round(num/denom, 3)), c(names(table(step2_complete$schoolid))))kable(nonresponse_freq, caption ="frequency of missing step 2 scores by instition")
frequency of missing step 2 scores by instition
x
HMS
0.282
UVa
0.215
VU
0.559
We also find it helpful to visually profile the missingness in this graphic below. Some of this missingness is due to questions being hierarchical.
Note, we plan to use the robust/sandwich variance estimator for regression models. One inclusion criteria is that the outcome variable (“Y”) must be available for a subject to be included in the analysis question (ie. if they did not report a step 2 score, we won’t perform relevant step 2 analysis on them).
::: panel-tabset ## Q1 Does order have an impact on Step 2 scores?
We cannot analyze Step 1 since all survey responses reported passing For Step 2 scores, I will perform a linear regression with:
summary(exam_order_mod) #model summary for reporting
Call:
lm(formula = score_step2 ~ which_exam_first + school_factor,
data = step2_complete)
Residuals:
Min 1Q Median 3Q Max
-112.100 -5.178 2.767 8.761 21.900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 261.3383 5.2033 50.225 <2e-16 ***
which_exam_first 0.8952 4.5256 0.198 0.844
school_factorUVA -4.0290 5.0342 -0.800 0.425
school_factorVUMS -0.9797 3.5128 -0.279 0.781
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.69 on 134 degrees of freedom
Multiple R-squared: 0.01018, Adjusted R-squared: -0.01198
F-statistic: 0.4593 on 3 and 134 DF, p-value: 0.7112
Based on the model output (p values), there doesn’t appear to be any signifigant associations between exam order and the score for Step 2. Since the R^2 value is essentially 0, I conclude that there was no effect of exam order on step 2 scores.
2.1 Q2 What factors affect Step 2 score?
Again, we cannot analyze Step 1 scores since all respondents reported passing.
Based on our SAP, if there are any covariates with more than 30% of responses missing, we will drop that variable or populate it with 0, depending on context. For example, the percent of Amboss questions completed will be filled with 0 for people who didn’t answer that they used Amboss since it seems safe to assume they didn’t complete any of the Amboss questions. For people who did select that they used Amboss but didn’t answer a percentage will be treated as missing
If less than 30% are missing, I may consider performing bootstrap sampling of known values to replace missing values.
After accounting for missingness, I will assess for co-linearity of the predictors (ie. correlation) using VIF. If there is high co-linearity, we will use LASSO to perform variable selection. If there is no evidence of concerning levels of colinearity, I will proceed with linear regression.
Code
#here I am creating the cleaned Uworld and Amboss variables. if they did not check that they used it, the percentage is set to 0.# if they did use it then the next ifelse checks if there is an NA where there should be a percent# if there is an NA where there should be a percent, we maintain this NA. If there is not, then we can report the percentage.# same process for uworldstep2_complete <- step2_complete %>%mutate(amboss_clean =ifelse(I.amboss ==0, 0, ifelse(is.na(step2_complete$amboss_percent_step2) ==TRUE, NA, step2_complete$amboss_percent_step2)),uworld_clean =ifelse(I.uworld ==0, 0, ifelse(is.na(step2_complete$uworld_percent_step2 ) ==TRUE, NA, step2_complete$uworld_percent_step2)))#inspecting percent missing, it seems like most responses are now complete enough for analysisstep2_complete$amboss_percent_step2 <-ifelse(is.na(step2_complete$amboss_percent_step2) ==TRUE, 0, step2_complete$amboss_percent_step2)class(step2_complete$practicetest_step2) <-"integer"step2_complete[,"on_target"] <-factor(step2_complete$target_score_step2, levels =c(1,2,3), labels =c("at target", "above target", "below target"))# I looked into the difference between practice_test_step2 (the final practice test I took before my exam was... 8 options) and practicetest_step2 (text response of how many practice tests did you take before step 2).# I decided to use practicetest_step2 (text response of how many practice tests did you take before step 2) but extract and recode it as numeric belowstep2_complete[, "practice_test_2_clean"] <-ifelse(is.na(step2_complete$practicetest_step2) ==TRUE, NA, substr(step2_complete$practicetest_step2, start =1, stop =1))step2_complete[, "number_of_practice_tests"] <-as.numeric(step2_complete$practice_test_2_clean) #I want to include the number of practice tests as just one numeric covariate, not coded as a factor like it is in practice_test_2_cleanstep2_complete$full_practice_test <-factor(step2_complete$full_test_practice_step2 , levels =c(1:4), labels =c("Yes and I'm glad I did", "Yes and it was unnecessary", "No and I wish I did", "No and I'm glad I did not"))step2_complete$simulate_full_practice <-ifelse(step2_complete$full_test_practice_step2 <=2, 1, 0)step2_complete$simulate_full_practice <-factor(step2_complete$simulate_full_practice, levels =c(0, 1), labels =c("No", "Yes"))step2_complete$length_step2 <-factor(step2_complete$length_step2 , levels =c(1:6), labels =c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks"))######## profile missingness in the step 2 data and addressstep2_profile <- step2_complete %>%relocate(c(score_step2, uworld_clean, amboss_clean, length_step2, simulate_full_practice, practice_score_step2, number_of_practice_tests, school_factor), .before = schoolid)percents_missing <-round(colSums(is.na(step2_profile))/nrow(step2_profile), 3)*100kable(percents_missing, caption ="Percent missing observations for pooled Step 2 survey")
Percent missing observations for pooled Step 2 survey
x
record_id
0.0
score_step2
0.0
uworld_clean
1.4
amboss_clean
9.4
length_step2
1.4
simulate_full_practice
0.0
practice_score_step2
29.7
number_of_practice_tests
5.1
school_factor
0.0
schoolid
0.0
exam_order
0.0
uworld_percent_step2
1.4
amboss_percent_step2
0.0
practicetest_step2
5.1
full_test_practice_step2
0.0
practice_test_step2
29.0
target_score_step2
0.0
I.uworld
0.0
I.firstaid
0.0
I.anki
0.0
I.sketchy
0.0
I.amboss
0.0
I.pathoma
0.0
I.boardsbeyond
0.0
I.other
0.0
order_factor
0.0
which_exam_first
0.0
on_target
0.0
practice_test_2_clean
5.1
full_practice_test
0.0
Multiple linear regression with:
Y = Step 2 score (from “What was your 3-digit score on the official Step 2 exam?”)
X1 = % UWorld (recoded from “Approximately what % of UWorld did you complete during your protected study period for step 2?”)
X2 = % Amboss (recoded from “Approximately what % of the Amboss question bank did you complete during your Step 2 protected study period?”)
X3 = length study (factor from “How long did you study for Step 2 during a protected study period?”)
X4 = number of practice tests (recoded from a text answer from “How many total practice tests did you take before Step 2?”)
X5 = full test day (yes/no code as binary from “Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?”)
X6 = final practice score (What was your 3 digit score on the final practice test you took before the Step 2 exam?)
Z = School (need to adjust for this in order to remove any influence)
Careful to note that not all cases are complete - for example there are 399 responses in the complete step 2 dataset, of which for the number of practice tests taken, 108 are missing and 291 have a response recorded.
Below I report the model results, sandwich variance, and VIF for step 2 scores model. After that, I show descriptive statistics for predictors and the outcome that was included in the model.
mod_step2_scores %>%tbl_regression() #this makes a cleaner looking summary table
Characteristic
Beta
95% CI1
p-value
uworld_clean
0.00
-0.08, 0.09
>0.9
amboss_clean
0.04
-0.04, 0.13
0.3
length_step2
3-4 weeks
—
—
5-6 weeks
1.9
-2.5, 6.4
0.4
7-8 weeks
-4.2
-9.7, 1.3
0.13
more than 8 weeks
-7.8
-21, 5.0
0.2
simulate_full_practice
No
—
—
Yes
0.41
-3.9, 4.7
0.9
practice_score_step2
0.60
0.46, 0.74
<0.001
number_of_practice_tests
-1.2
-2.5, 0.18
0.090
school_factor
HMS
—
—
UVA
0.85
-4.2, 5.9
0.7
VUMS
-1.4
-6.3, 3.5
0.6
1 CI = Confidence Interval
Trying the model with imputation, but there is an issue right now with just imputing one value instead of randomly sampling…
Note
imputation needs to be fixed before interpreting this model! see the snippet of the updated data showing this issue
Code
#following this thought process for the variables that had missingness for the 2nd model# df %>% mutate(var = ifelse(!is.na(var),var, sample(var[!is.na(var)],1)))# as described on this blog https://community.rstudio.com/t/replacing-nas-with-random-values-from-a-set/158833/2 ## model variables were uworld_clean + amboss_clean + length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor# with missingness in uworld_clean, amboss_clean, length_step2, practice_score_step2, and number_of_practice_testsstep2_impute <- step2_complete %>%select( score_step2, uworld_clean, amboss_clean, length_step2, simulate_full_practice, practice_score_step2, number_of_practice_tests, school_factor ) %>%mutate(uworld =ifelse(!is.na(uworld_clean), uworld_clean, sample(uworld_clean[!is.na(uworld_clean)], 1)),amboss =ifelse(!is.na(amboss_clean), amboss_clean, sample(amboss_clean[!is.na(amboss_clean)], 1)),length =factor(ifelse(!is.na(length_step2), length_step2, sample(length_step2[!is.na(length_step2)], 1)), levels =c(1:6), labels =c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks")),prac_score =ifelse(!is.na(practice_score_step2), practice_score_step2, sample(practice_score_step2[!is.na(practice_score_step2)], 1)),num_tests =ifelse(!is.na(number_of_practice_tests), number_of_practice_tests, sample(number_of_practice_tests[!is.na(number_of_practice_tests)], 1)) )colSums(is.na(step2_impute)) #to confirm the random sampling replacement worked
#here is a view of one change:# cbind(step2_impute$length_step2, step2_impute$length)cbind(step2_impute$practice_score_step2, step2_impute$prac_score)[1:22,]
mod_impute_step2_scores %>%tbl_regression() #this makes a cleaner looking summary table
Characteristic
Beta
95% CI1
p-value
uworld
0.05
-0.03, 0.12
0.2
amboss
0.05
-0.02, 0.13
0.2
length
1-2 weeks
—
—
3-4 weeks
15
-4.8, 36
0.13
5-6 weeks
15
-5.5, 35
0.2
7-8 weeks
11
-10, 31
0.3
more than 8 weeks
-13
-37, 10
0.3
simulate_full_practice
No
—
—
Yes
2.1
-1.6, 5.8
0.3
prac_score
0.55
0.44, 0.67
<0.001
num_tests
-0.63
-1.8, 0.53
0.3
school_factor
HMS
—
—
UVA
-0.24
-4.9, 4.4
>0.9
VUMS
-2.7
-7.4, 2.1
0.3
1 CI = Confidence Interval
2.2 Q3 What is associated (in this data) with pushing back a Step 1 exam date?
Here, I will perform logistic regression with
Y = yes or no (1 = yes, 2 = no for “push_step1”)
There may not be sufficient data on this since only 20 people responded that they decided to push back Step 1. The factors that were measured are:
push remember step1 (1 = I only remember the form name, 2 = I only remember the score, 3 = I remember the form name and the score, 4 = I don’t remember either) - we decided not to include this variable (can change later if desired)
push score only step 1 (1 = NBME, 2 = Uworld)
push practice test step 1 ( 1 - 8 listing various exams)
push nbme practice score (from 0 to 100%)
push uw practice score (from 180 to 300)
Listing variables by name and if I have included them:
“push_step1_1” yes
“push_remember_step1_1” not included b/c a precursor question
“push_score_only_step1_1” not currently included but could be
“push_practice_test_step1_1” yes
“push_nbme_practice_score_step1_1” yes
“push_uw_practice_score_step1_1” yes
Code
step1_complete$push_step1 <-ifelse(step1_complete$push_step1 ==2|is.na(step1_complete$push_step1) ==TRUE, 2, 1) #recording the NA's to be "No" (they did not push back step 1)step1_complete$push_step1_label <-factor(step1_complete$push_step1, levels =c(1, 2), labels =c("Yes", "No")) #making a nice descriptive label#did_push_df <- step1_complete %>% filter(push_step1_label == "Yes")#dat %>% filter(!is.na(push_remember_step1_1))step1_complete$push_practice_test_step1 <-factor(step1_complete$push_practice_test_step1, levels =c(1:8), labels =c("NBME 25", "NBME 26", "NBME 27", "NBME 28", "NBME 29", "NBME 30", "UWorld 1", "UWorld 2"))#making labels for descriptive tableunits(step1_complete$push_uw_practice_score_step1) <-"score units"units(step1_complete$push_nbme_practice_score_step1) <-"percent"label(step1_complete$push_practice_test_step1) <-"exam that triggered pushing back"label(step1_complete$push_nbme_practice_score_step1) <-"NBME P(passing) that triggered pushing back"label(step1_complete$push_uw_practice_score_step1) <-"3 digit UWorld score that triggered pushing back"caption <-"Description of indiviuals that pushed back Step 1"table1(~ push_practice_test_step1 + push_uw_practice_score_step1 + push_nbme_practice_score_step1 |push_step1_label, data=step1_complete, topclass="Rtable1-zebra")
Yes (N=20)
No (N=162)
Overall (N=182)
exam that triggered pushing back
NBME 25
0 (0%)
0 (0%)
0 (0%)
NBME 26
0 (0%)
0 (0%)
0 (0%)
NBME 27
1 (5.0%)
0 (0%)
1 (0.5%)
NBME 28
1 (5.0%)
0 (0%)
1 (0.5%)
NBME 29
0 (0%)
0 (0%)
0 (0%)
NBME 30
0 (0%)
0 (0%)
0 (0%)
UWorld 1
0 (0%)
0 (0%)
0 (0%)
UWorld 2
1 (5.0%)
0 (0%)
1 (0.5%)
Missing
17 (85.0%)
162 (100%)
179 (98.4%)
3 digit UWorld score that triggered pushing back (score units)
Mean (SD)
180 (NA)
NA (NA)
180 (NA)
Median [Min, Max]
180 [180, 180]
NA [NA, NA]
180 [180, 180]
Missing
19 (95.0%)
162 (100%)
181 (99.5%)
NBME P(passing) that triggered pushing back (percent)
Mean (SD)
65.6 (18.3)
NA (NA)
65.6 (18.3)
Median [Min, Max]
67.5 [30.0, 87.0]
NA [NA, NA]
67.5 [30.0, 87.0]
Missing
12 (60.0%)
162 (100%)
174 (95.6%)
2.3 Descriptive Statistics (now just needs to look uniform)
Descriptive statistics will be reported in total and not by school (as requested by the collaborators).
Question banks
% UWorld for step 1 and 2
% Amboss for step 1 and 2
length of study for step 1 and step 2 (barplot with frequency table)
number of practice tests (recoded from a text answer from “How many total practice tests did you take before Step 2?”)
full test day as factor from “Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?” “full_practice_test”
histogram of Step 2 scores
frequency table of Step 1 scores
what resources are most widely used barplot
number of practice tests histogram (among the people who answered the question) for step 1
summarize comments for other_resources, other_step, changes_step
Practice Scores for Step 1 and Step 2 - probability of passing for step 1 - final_nbme_practice_score_step1 is a percent from 0 to 100 - final_uw_practice_score_step1 is a 3 digit UWorld score for step 1 - final practice score for step 2 (3 digit score)
looking at the data dictionary, we are given final_nbme_practice _score_step1 and final_uw_practice_s core_step1 which are two different covariates which are ONLY in Step 1.
Since everyone passed step 1, I did not run a regression model that included these covariates. Instead, I will make a plot to show this in descriptive data.
These specific questions were not included for step 2, instead there is a question for practice_score_step2, which I do include in the model for step 2 score predictors above.
Code
# Here is how I am planning to organize it:# 1 Question banks# - % UWorld for step 1 and 2# - % Amboss for step 1 and 2# 2 length of study for step 1 and step 2 (barplot with frequency table)# 3 # of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")# 4 number of practice tests histogram (among the people who answered the question) for step 1# 5 full test day as factor from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"# 6 histogram of Step 2 scores# 7 frequency table of Step 1 scores# 8 what resources are most widely used barplot# 9 practice score step 2# 10 summarize comments for other_resources, other_step, changes_step# 11 Practice Scores for Step 1 and Step 2# 11a probability of passing for step 1# 11b final_nbme_practice_score_step1 is a percent from 0 to 100# 11c final_uw_practice_score_step1 is a 3 digit UWorld score for step 1# 11d final practice score for step 2 (3 digit score)#### and these are covered in other plots already (as numbered)# 11b looking at the data dictionary, we are given final_nbme_practice _score_step1 and final_uw_practice_s core_step1 which are two different covariates which are ONLY in Step 1.# 7 Since everyone passed step 1, I did not run a regression model that included these covariates. Instead, I will make a plot to show this in descriptive data.# 9 These specific questions were not included for step 2, instead there is a question for practice_score_step2, which I do include in the model for step 2 score predictors above.############################ now see the plots below as organized by the list I made above# 1 Question banks# - % UWorld for step 1 and 2# - % Amboss for step 1 and 2qudf <-data.frame(amount =c(step1_complete$uworld_percent_step1, step1_complete$amboss_percent_step1, step2_complete$uworld_percent_step2, step2_complete$amboss_percent_step2), step_exam =c(rep("Step 1", times =nrow(step1_complete)*2), rep("Step 2", times =nrow(step2_complete)*2)), resource_name =c(rep("UWorld", times =nrow(step1_complete)), rep("Amboss", times =nrow(step1_complete)), rep("Uworld", times =nrow(step2_complete)), rep("Amboss", times =nrow(step2_complete))))ggplot(qudf, aes(amount)) +geom_histogram() +facet_wrap(~step_exam + resource_name) +theme_minimal() +scale_fill_brewer(palette="Blues")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# 2 length of study for step 1 and step 2 (barplot with frequency table)## how long did the students study barplot# for step 1step1_complete$length_step1 <-factor(step1_complete$length_step1 , levels =c(1:6), labels =c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks"))ggplot(data.frame(step1_complete), aes(x=length_step1)) +geom_bar() +theme_minimal() +xlab("Time") +ylab("Frequency") +labs(title ="How long did you study for Step 1 during a protected study period?")
Code
# for step 2ggplot(data.frame(step2_complete), aes(x=length_step2)) +geom_bar() +theme_minimal() +xlab("Time") +ylab("Frequency") +labs(title ="How long did you study for Step 2 during a protected study period?")
Code
# 3 # of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")ggplot(aes(x = number_of_practice_tests), data= step2_complete) +geom_histogram() +theme_minimal() +xlab("Number of tests") +ylab("Frequency") +labs(title ="How many total practice tests did you take before Step 2?")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# 4 number of practice tests histogram (among the people who answered the question) for step 1ggplot(aes(x = practicetest_step1), data= step1_complete) +geom_histogram() +theme_minimal() +xlab("Number of tests") +ylab("Frequency") +labs(title ="How many total practice tests did you take before Step 1?") ## number of practice tests histogram (among the people who answered the question)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# 5 full test day as factor from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"ggplot(aes(x = full_practice_test), data= step2_complete) +geom_bar() +theme_minimal() +xlab("Response") +ylab("Frequency") +labs(title ="Did you simulate a full Step 2 test day experience, \n e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Code
# 7 frequency table of Step 1 scoreskable(table(step1_complete$score_step1), caption ="Frequency of passing Step 1") # would be nice to have a way to summarize these in the same table, but this can just be remade with a table in Word
Frequency of passing Step 1
Var1
Freq
1
154
Code
kable(table(is.na(step1_complete$score_step1)), caption ="True indicates missing values for Step 1 score")
True indicates missing values for Step 1 score
Var1
Freq
FALSE
154
TRUE
28
Code
# 8 what resources are most widely used barplot resources_2 <- step2_complete %>%select(I.uworld:I.other) %>%summarise(across(everything(), ~sum(.)))resources_1 <- step1_complete %>%select(I.uworld:I.other) %>%summarise(across(everything(), ~sum(.)))rdf <-data.frame(amount =c(t(resources_1), t(resources_2)), step_exam =c(rep("Step 1", 8), rep("Step 2", 8)), resource_name =rep(names(resources_1), times =2))# http://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization used this as a guide#what resources are most widely used? I want to sort this barplot in order of frequency but for some reason this wasn't working for me with reorder()ggplot(data=rdf, aes(x=resource_name, y=amount, fill = step_exam)) +geom_bar(stat="identity", position=position_dodge())+theme_minimal() +coord_flip() +scale_fill_brewer(palette="Blues")
Code
# 9 practice score step 2ggplot(aes(x = practice_score_step2), data= step2_complete) +geom_histogram() +scale_fill_brewer(palette="Blues") +theme_minimal() +xlab("Self Reported Step 2 Practice Score") +ylab("Frequency") +labs(title ="Reported Step 2 Practice Scores")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
NBME and Dr. Legallo's videos and notes from preclinical
1
nbme practice exams
1
NBME practice tests
1
Online MedEd
1
Pixorize
1
Youtube
1
Code
#If you used other resources not listed above, would you use them again? step1_complete$other_hascontents <-as.integer(nchar(step1_complete$other_resources_step1) >0) #flagging the rows that have contents https://stackoverflow.com/questions/64744988/testing-to-see-if-characters-are-present-in-a-cell-in-ras_tibble(step1_complete[which(step1_complete$other_hascontents ==1), c("other_resources_step1", "other_step1") ])
# A tibble: 8 × 2
other_resources_step1 other_step1
<chr> <chr>
1 Pixorize Yes, helped immensel…
2 Dirty Medicine YouTube videos yes
3 Dirty Medicine Yes
4 nbme practice exams yes
5 NBME practice tests Yes
6 Youtube Yes
7 Online MedEd Maybe
8 NBME and Dr. Legallo's videos and notes from preclinical yes, NBME and Dr. Le…
Code
# what would you change? step1_complete$change_hascontents <-as.integer(nchar(step1_complete$changes_step1) >0) as.data.frame(step1_complete[which(step1_complete$change_hascontents ==1), "changes_step1" ])
step1_complete[which(step1_complete$change_hascontents == 1), "changes_step1"]
1 I would ditch pathoma and just do Anki, UWorld, sketchy and pixorize just for Biochem and immunology. I would focus my UWorld on things our preclinical curriculum does not cover well like anatomy, pharmacology, microbiology, etc. I would focus less on clinical info since we were coming off a year of clerkships.
2 I did it in 3 weeks and it was VERY stressful. However, this ended up being more than enough time. I would feel reassured that again it is PASS/FAIL so it will be OK!!
3 Would have started earlier in preclinical - would have ignored my preclinical classes and used more outside resources to build foundation. It was difficult to get up to speed for Step 1, and I regret not studying more in preclinical as I think my Step 2 score would have been significantly higher. I feel that what was not discussed was how much Step 1 content is still relevant to Step 2.
4 I would have studied for 4 weeks instead of 6 weeks. However, I took my test after thanksgiving so there were less test dates available.
5 More SketchyMicro. Start using outside resources during 1st year. More UWorld.
6 I could have studied less (like 3-4 weeks) because I was consistently scoring well enough to pass on my practice tests but I was too nervous to take any chances and decided to instead just use the entire 5 weeks I had already decided to allocate
7 I would make all of my study time part time (like I did when at home for the first 4 weeks). I wouldn't let classmates studying 24/7 (unnecessarily in my opinion) stress me out.
8 I took about 3 weeks to review Sketchy Micro/Pharm and Pathoma as well as do some Uworld blocks and practice tests. After scoring so highly on my practice test, I actually stopped studying except for finishing Sketchy. If I could go back, I would just incorporate more of Sketchy, Pathoma, and Uworld into pre-dedicated studying and then just take Step 1 with perhaps 1-2 weeks max of dedicated.
9 Focus more on the things that overlap with step 2
10 I just used sketchy/anki for micro, and that was great. I don't love it for everything, but I think I might have also tried that combo for some of the biochem.
11 I would have taken the exam early. I think the school's guidance of waiting until you've gotten 99% chance of passing on 2 practice tests was probably a bit excessive but without more information I didn't want to risk not studying as much as possible.
12 would have done more sketchy, moved through pathoma faster (knew most of it already from shelf exams/clinical year)
13 I would keep a more strict study schedule in the beginning of my study period.
14 Taking a longer period of time for dedicated. Not done anything else but study during those months
15 Not do Sketchy path
16 In hindsight, I realize that a more effective approach would have been to dedicate approximately six weeks to my study plan. During this time, I would have worked through the entirety of UWorld, Sketchy Micro and Pharm, as well as Pathoma. Additionally, I would have made it a point to take every available UWorld and NBME practice test, ensuring that I simulated strict testing conditions. I should note that my ability to complete only about 25% of UWorld was somewhat facilitated by my solid foundation from my first year of medical school and the experience gained from my shelf exams during the PCE.
17 Spent more time studying during pre-clinical years
18 Use less time. 4-5 weeks is good enough.
19 I would have taken Step 1 prior to 3rd year clerkships.
20 I wish it was after pre-clinical year and not after clinical year.
21 Since I took step 1 after clerkships, I did not have to study much. I used it as a prep time for basic sciences of step 2. I would not have bought World for step 1 as I would have been fine without it
22 I wish we had completed step 1 post pre clerkship with dedicated time.
23 I would focus on pathophysiology section of uworld
24 It's really not that hard to pass. Just knowing First Aid and going through UWorld is sufficient.
25 I would have started reviewing material (i.e. watching B&B, Sketchy pharm) before dedicated started so that I could devote my dedicated time almost exclusively to UWorld questions
26 Though I probably could have studied much less, I did not find the "% likelihood of passing" metric very reassuring and still ended up studying a lot of make sure it was as reassuring as possible.
27 I would probably study really hard for 2 weeks and take it at the end of the 2 weeks
28 I would probably take the test after 2 weeks instead of 3
29 I would have taken it immediately after ending clerkships and then gone off to enjoy my summer. I think many students would be more than capable of taking the exam following first year before the start of clerkships, and in many ways I think the increased proximity to many of the basic science concepts tested on Step 1 would have been a major advantage. Some of the studying felt like taking a step backwards instead of progressing to try to advance the depth of my clinical knowledge going into the immersion phase.
30 I ended up taking 12 weeks. M1 was terrible for me and I didn't know how to study. So, I would go back and study more for it during M1 or M2 but even 12 weeks was pushing it for me to make up the deficit I was in
31 I took a practice test on Day 1 of studying, which I did not find helpful. I did poorly on Day 1 and my score improved by 50 points over 2.5 weeks, so the Day 1 baseline was useless. I would have studied for slightly longer to feel more confident and take one more practice exam. I took 1 practice exam pre-studying and 2 after studying. I passed 1 exam 8 days prior to the test. My study period was 23 days, with days off during that, and I think near 28 days would have been ideal to feel less rushed and taken an additional exam. I do not have a great memory, so I felt like I needed more time, but everyone's different.
32 Much more uworld, less sketchy and anki
33 None
34 If anything, maybe one less week. It can definitely be done in 4 weeks but the extra time gave me a comfortable buffer
35 I would have a shorter study period and focused more on step 2
36 I would have switched to timed, untutored mode on U world sooner. On the real exam, my confidence was so shaky because I was used to the immediate gratification of tutor mode and being able to take my time to think through the harder questions.
37 I wish I would have started the premade anki deck during M1 instead of at the start of M2. I finished about 50% of STEP 1 cards and 75% of STEP 2 cards by the time I took Step 1 and I think it would have been much quicker to a passing score if i had a higher percentage of STEP 1 cards mastered. I did not use UWorld at all because I found that reviewing missed questions didn't help me at all. I never remembered the info or explanations unless I memorized it with anki first. Once I had things memorized, I had no issues applying the knowledge in exam form. The only questions I missed were ones that I had never seen on anki.
38 Just do UWorld and know First Aid well, that's all you really need.
39 None.
Code
#dat$study_amount_step1_1# 11 Practice Scores for Step 1 and Step 2# 11a final_nbme_practice_score_step1 is a percent from 0 to 100# 11b final_uw_practice_score_step1 is a 3 digit UWorld score for step 1# 11c final practice score for step 2 (3 digit score)hist(step1_complete$final_nbme_practice_score_step1, main ="What was the probability of passing you received \n on your final NBME form before the Step 1 exam?", breaks =40)
Code
kable(table(!is.na(step1_complete$final_nbme_practice_score_step1)), caption ="More responses are missing for P(passing step 1) from NBME") #so 118 are missing and 64 are there
More responses are missing for P(passing step 1) from NBME
Var1
Freq
FALSE
118
TRUE
64
Code
hist(step1_complete$final_uw_practice_score_step1, main ="What was the 3 digit score you received on your final UWorld form before the Step 1 exam?", breaks =40)
Code
kable(table(is.na(step1_complete$final_uw_practice_score_step1)), caption ="Most responses are missing for UWorld Step 1 final score")
Most responses are missing for UWorld Step 1 final score
Var1
Freq
FALSE
16
TRUE
166
Code
hist(step2_complete$practice_score_step2, main ="What was your 3 digit score on the final practice test \n you took before the Step 2 exam?", breaks =40)
Code
kable(table(is.na(step2_complete$practice_score_step2)), caption ="there are 41 missing final step2 practice scores")
there are 41 missing final step2 practice scores
Var1
Freq
FALSE
97
TRUE
41
3 Appendix/notes
All the analyses are performed using the following: - R version 4.2.2 (2022-06-24); R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. - Harrell Jr FE (2022). rms: Regression Modeling Strategies. R package version 6.3-0, https://CRAN.R-project.org/package=rms. The table below lists packages used in this document.
---title: "USMLE Data Analysis"author: "Lisa Levoir"date: "`r format(Sys.time(), '%B %d, %Y')`"format: html: theme: flatly code-fold: true code-tools: true html-math-method: katex toc: true toc-depth: 3 fig-width: 13 fig-height: 10 toc-title: "Contents" number-sections: true self-contained: true self-contained-math: true smooth-scroll: true fontsize: 0.8em title-block-banner: true citation-location: margin include-after-body: graph_fold.htmleditor: visualengine: knitr---# Data import and cleaningIn this section I import the data and prepare it for analysis.```{r setup}#| warning: false#| message: false#| echo: fencedlibrary(tidyverse)library(plotly)library(dplyr)library(tidyr)library(knitr)library(table1) #Create HTML Tables of Descriptive Statistics https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html#library(OMTM1) #https://github.com/schildjs/OMTM1/library(Hmisc)library(rms) # Regression Modeling Strategies by Frank https://cran.r-project.org/web/packages/rms/index.htmllibrary(modelsummary) #Summary Tables and Plots for Statistical Models and Data: Beautiful, Customizable, and Publication-Ready https://cran.r-project.org/web/packages/modelsummary/index.htmllibrary(scales) # The scales packages provides the internal scaling infrastructure used by ggplot2, and gives you tools to override the default breaks, labels, transformations and palettes. https://scales.r-lib.orglibrary(viridis) #colorslibrary(cowplot) #allows me to use plotgridlibrary(gridExtra) #adding tables to plotslibrary(visdat) #shows missing datalibrary(GGally) #makes pairs plotslibrary(sandwich) #for robust standard errorslibrary(gtsummary) #makes the tables look nicersetwd("/Users/lisalevoir/BIOS7351_Collab/data_project2") #this line I would need to run in the consoleknitr::opts_knit$set(root.dir ="/Users/lisalevoir/BIOS7351_Collab/github/BIOS_Collaboration/project_2_analysis") #now I set global options for knitting, I also had to toggle global options > R Markdown > evaluate chunks in current directory#import the datadat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/combined_data_203.csv")#to compare that the merging went as expectedVUMSdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_VUMS.csv")HMSdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_HMS.csv")UVAdat <-read.csv("/Users/lisalevoir/BIOS7351_Collab/data_project2/DATA_UVA.csv")```### Inclusion/ExclusionCriteria to exclude students who most likely took a scored exam:- Any PhD students (n = 2)- Any 5th year program students (n = 19)- M4 students at Vanderbilt (n = 5)- Students who did not complete either Step survey (n = 2)- Students who specifically stated they took a scored Step 1 (n=1)Based on our criteria we would exclude record IDs:- VUSM: 23, 26, 39, 40, 54, 3, 8, 12, 49, 60, 62, 64- HMS: 1, 21, 28, 30, 34, 37, 39, 41, 44, 47, 49, 61- UVA: 33, 47, 80, 81, 83```{r data_cleaning}#| warning: false#| message: false#| echo: false#creating a unique ID for each participant so I can easily summarize how many people have been included/excluded at each stepdat <- dat %>%mutate(uniqueID =paste(school, record_id))VUMSdat <- VUMSdat %>%mutate(uniqueID =paste("VUSM", record_id))HMSdat <- HMSdat %>%mutate(uniqueID =paste("HMS", record_id))UVAdat <- UVAdat %>%mutate(uniqueID =paste("UVASOM", record_id))## compare the individuals that were excluded by the collaborator (in dat) vs. the full datasetall_survey_ids <-c(VUMSdat$uniqueID, HMSdat$uniqueID, UVAdat$uniqueID)collaborator_survey_ids <- dat %>%select(uniqueID) %>%pull()``````{r}#| warning: false#| message: false#| echo: false#the list of IDs we decided as a group to excludeexcludeVU <-c(23, 26, 39, 40, 54, 3, 8, 12, 49, 60, 62, 64)excludeHMS <-c(1, 21, 28, 30, 34, 37, 39, 41, 44, 47, 49, 61)excludeUVA <-c(33, 47, 80, 81, 83)'%!in%'<-function(x,y)!('%in%'(x,y)) #make a way to use the not in commandVU_in <-filter(VUMSdat, record_id %!in% excludeVU)H_in <-filter(HMSdat, record_id %!in% excludeHMS)UVA_in <-filter(UVAdat, record_id %!in% excludeUVA)#now I want to select the columns I'd like to include for all of my analysis (so they're in the proper order for a cbind). this will be relatively easy to come back to edit later, if needed. # I am also including a school identifier as a common column before I rbind all 3 school data frames togetherVU_in[,"schoolid"] <-"VU"UVA_in[,"schoolid"] <-"UVa"H_in[,"schoolid"] <-"HMS"############ plan for how I will get the data in a format I want:# - pull relevant columns by "starts with"# - confirm all column names match, then# - rbind together once# - then I can select from this sheet the questions relevant to Step 1 first with ends with "_1", and those who took Step 1 second with a "_2"############ Now pulling the common columns we're interested in as predictors and outcomes##note, for VU I removed ""number_other_courses_step1_1" and starts_with("other_courses_step1_1___1) because these questions were not on the other school surveystook_step1_VU <- VU_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step1"),starts_with("amboss_percent_step1"),starts_with("length_step1"),starts_with("practicetest_step1"),starts_with("full_test_practice_step1"), # split into binary: Yes and I am glad I did, Yes and it was unnecessary, No and I wish I did, No and I am glad I did notstarts_with("push_step1"),starts_with("push_practice_test_step1"),starts_with("push_nbme_practice_score_step1"),starts_with("push_uw_practice_score_step1"),starts_with("final_nbme_practice_score_step1"),starts_with("final_uw_practice_score_step1"),starts_with("score_step1"),starts_with("resources_step1"),starts_with("other_resources_step1"),starts_with("other_step1"),starts_with("changes_step1"),"uniqueID","schoolid", "exam_order" ) took_step1_UVA <- UVA_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step1"),starts_with("amboss_percent_step1"),starts_with("length_step1"),starts_with("practicetest_step1"),starts_with("full_test_practice_step1"),starts_with("push_step1"),starts_with("push_practice_test_step1"),starts_with("push_nbme_practice_score_step1"),starts_with("push_uw_practice_score_step1"),starts_with("final_nbme_practice_score_step1"),starts_with("final_uw_practice_score_step1"),starts_with("score_step1"),starts_with("resources_step1"),starts_with("other_resources_step1"),starts_with("other_step1"),starts_with("changes_step1"),"uniqueID","schoolid", "exam_order" )took_step1_H <- H_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step1"),starts_with("amboss_percent_step1"),starts_with("length_step1"),starts_with("practicetest_step1"),starts_with("full_test_practice_step1"),starts_with("push_step1"),starts_with("push_practice_test_step1"),starts_with("push_nbme_practice_score_step1"),starts_with("push_uw_practice_score_step1"),starts_with("final_nbme_practice_score_step1"),starts_with("final_uw_practice_score_step1"),starts_with("score_step1"),starts_with("resources_step1"),starts_with("other_resources_step1"),starts_with("other_step1"),starts_with("changes_step1"),"uniqueID","schoolid", "exam_order" )## now rbinding the three schools togethertook_step1general <-rbind(took_step1_H, took_step1_UVA, took_step1_VU)## splitting the dataset so I also have reference sheets specific to step 1 first and step 1 secondtook_step1_first <- took_step1general %>%select(ends_with("_1"), "schoolid") #I actually haven't used this data frame in the analysistook_step1_second <- took_step1general %>%select(ends_with("_2"), "schoolid") #I actually haven't used this data frame in the analysis#this function will accommodate for the survey which split up the step 2 responses by whether it was taken first or second. I would like to run analysis (and gather percent missing) across all scores as they are available. I can add an indicator column later to identify people who took it first/second. The idea with this function is the output (called storage_df) should be easy to colbind onto the original data frame. Voila!#also if we decide to change our minds and include more variables it will be quick to run.merge_my_columns <-function(input_cols, source_df){storage_df <-as.data.frame(matrix(nrow =nrow(source_df), ncol =length(input_cols)))names(storage_df) <- input_colsfor(i in1:length(input_cols)) { cols <- source_df %>%select(starts_with(input_cols[i])) %>%names()print(cols) new_cobined_col <-coalesce(source_df[, cols[1]], source_df[, cols[2]]) storage_df[, i] <- new_cobined_col}print("Above is a list of columns I have combined for you. Hope it looks right!")return(storage_df)}######## combine results across exam order for Step 2class(took_step1general$practicetest_step1_1) <-"integer"#had to change class in order to coalesce these. the original question was "How many total practice tests did you take before Step 1?" and the response field was text but it should just be integers. ## This will create a warning in the results that NAs are introduced by coercion but that is alrightclass(took_step1general$practicetest_step1_2) <-"integer"class(took_step1general$push_uw_practice_score_step1_2) <-"integer"class(took_step1general$push_uw_practice_score_step1_1) <-"integer"cols_step1 <-c("uworld_percent_step1", "amboss_percent_step1", "length_step1", "practicetest_step1","full_test_practice_step1", "push_step1", "push_practice_test_step1", "push_nbme_practice_score_step1", "push_uw_practice_score_step1" ,"final_nbme_practice_score_step1", "final_uw_practice_score_step1", "score_step1", "other_resources_step1", "other_step1","changes_step1")#now I need to handle the resources question, which was handled differentlytookstep1_resources <- took_step1general %>%mutate(I.uworld =coalesce(resources_step1_1___1 + resources_step1_2___1), I.firstaid =coalesce(resources_step1_1___2 + resources_step1_2___2),I.anki =coalesce(resources_step1_1___3 + resources_step1_2___3), I.sketchy =coalesce(resources_step1_1___4 + resources_step1_2___4), I.amboss=coalesce(resources_step1_1___5 + resources_step1_2___5), I.pathoma =coalesce(resources_step1_1___6 + resources_step1_2___6),I.boardsbeyond =coalesce(resources_step1_1___7 + resources_step1_2___7),I.other =coalesce(resources_step1_1___8 + resources_step1_2___8)) %>%select(I.uworld, I.firstaid, I.anki, I.sketchy, I.amboss, I.pathoma, I.boardsbeyond, I.other)to_add <-merge_my_columns(input_cols = cols_step1, source_df = took_step1general)#table(to_add$score_step1) ## all step 1 scores have an outcome so there is no one to drop/filter outstep1_complete <-bind_cols(took_step1general, to_add, tookstep1_resources) %>%select(record_id, uniqueID:I.other)``````{r data_cleaning_forstep2}#| warning: false#| message: false#| echo: falsetook_step2_VU <- VU_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step2"),starts_with("amboss_percent_step2"),starts_with("length_step2"),starts_with("practicetest_step2"),starts_with("full_test_practice_step2"),starts_with("practice_score_step2"),starts_with("practice_test_step2"),starts_with("score_step2"),starts_with("target_score_step2"),starts_with("resources_step2"),"uniqueID","schoolid", "exam_order" ) #note, I removed ""number_other_courses_step1_1" and starts_with("other_courses_step1_1___1) because these questions were not on the other school surveystook_step2_UVA <- UVA_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step2"),starts_with("amboss_percent_step2"),starts_with("length_step2"),starts_with("practicetest_step2"),starts_with("full_test_practice_step2"),starts_with("practice_score_step2"),starts_with("practice_test_step2"),starts_with("score_step2"),starts_with("target_score_step2"),starts_with("resources_step2"),"uniqueID","schoolid", "exam_order" )took_step2_H <- H_in %>%select(starts_with("record_id"),starts_with("uworld_percent_step2"),starts_with("amboss_percent_step2"),starts_with("length_step2"),starts_with("practicetest_step2"),starts_with("full_test_practice_step2"),starts_with("practice_score_step2"),starts_with("practice_test_step2"),starts_with("score_step2"),starts_with("target_score_step2"),starts_with("resources_step2"),"uniqueID","schoolid", "exam_order" )took_step2general <-rbind(took_step2_H, took_step2_UVA, took_step2_VU)took_step2_first <- took_step2general %>%select(ends_with("_1"), "schoolid", "uniqueID") #don't plan on using these, but available if neededtook_step2_second <- took_step2general %>%select(ends_with("_2"), "schoolid", "uniqueID")#now I need to handle the resources question, which was handled differentlytookstep2_resources <- took_step2general %>%mutate(I.uworld =coalesce(resources_step2_1___1 + resources_step2_2___1), I.firstaid =coalesce(resources_step2_1___2 + resources_step2_2___2),I.anki =coalesce(resources_step2_1___3 + resources_step2_2___3), I.sketchy =coalesce(resources_step2_1___4 + resources_step2_2___4), I.amboss=coalesce(resources_step2_1___5 + resources_step2_2___5), I.pathoma =coalesce(resources_step2_1___6 + resources_step2_2___6),I.boardsbeyond =coalesce(resources_step2_1___7 + resources_step2_2___7),I.other =coalesce(resources_step2_1___8 + resources_step2_2___8)) %>%select(I.uworld, I.firstaid, I.anki, I.sketchy, I.amboss, I.pathoma, I.boardsbeyond, I.other)########combine results across exam order for Step 2class(took_step2general$practice_score_step2_2) <-"integer"#had to change class in order to coalesce theseclass(took_step2general$score_step2_1) <-"integer"class(took_step2general$score_step2_2) <-"integer"columns_to_summarize <-c("uworld_percent_step2", "amboss_percent_step2", "length_step2", "practicetest_step2","full_test_practice_step2", "practice_score_step2", "practice_test_step2", "score_step2","target_score_step2")to_add <-merge_my_columns(input_cols = columns_to_summarize, source_df = took_step2general)#listing the people with no outcome reporteddrop_these_with_no_outcome <-bind_cols(took_step2general, to_add) %>%filter(is.na(score_step2)) %>%select("record_id", "schoolid", "uniqueID")#dropping the people with no outcomeintermediate_step2 <-bind_cols(took_step2general, to_add, tookstep2_resources) %>%filter(!is.na(score_step2))#creating the complete data setstep2_complete <- intermediate_step2 %>%select(record_id, schoolid:I.other)```There were `r length(intersect(all_survey_ids, collaborator_survey_ids))` participants after the the clients did data exclusion by hand, as compared to the total survey size which was `r length(step1_complete$uniqueID)` unique individuals included in the step 1 data.These are the record IDs that were excluded by the collaborator: `r setdiff(all_survey_ids, collaborator_survey_ids)`.For our analysis, I originally included everyone from the 3 school specific data sets. Then, we flagged individuals to remove and I remade the source data set. These are the records IDs that we decided to exclude:- VU record ids: 23, 26, 39, 40, 54, 3, 8, 12, 49, 60, 62, 64- HMS record ids: 1, 21, 28, 30, 34, 37, 39, 41, 44, 47, 49, 61- UVA record ids: 33, 47, 80, 81, 83Below are tables for Step 1 and Step 2 response inclusion by school:```{r}kable(table(step1_complete$schoolid), caption ="Number of Step 1 responses included by school")kable(table(step2_complete$schoolid), caption ="Number of Step 2 responses included by school")```Notice that unfortunately for Step 2 analysis, we had to drop `r nrow(drop_these_with_no_outcome)` individuals who did not report a step 2 score. These raw counts and frequencies of people who did not give a step 2 score (and are therefor not eligible for analysis) are listed by institution below.```{r}# describing the missingnesskable(table(drop_these_with_no_outcome$schoolid), caption ="Number of missing Step 2 scores by institution")num <-as.vector(table(drop_these_with_no_outcome$schoolid))denom <-as.vector(table(step2_complete$schoolid))nonresponse_freq <-setNames(c(round(num/denom, 3)), c(names(table(step2_complete$schoolid))))kable(nonresponse_freq, caption ="frequency of missing step 2 scores by instition")```We also find it helpful to visually profile the missingness in this graphic below. Some of this missingness is due to questions being hierarchical.```{r}######### visually profile missing responsesp1 <- visdat::vis_miss(step1_complete)p2 <- visdat::vis_miss(step2_complete)title_gg <-ggdraw() +draw_label("Response missingness for pooled survey results across exam order")gridded <-plot_grid(p1, p2, label_size =12, ncol =2, align ="hv", label_x =0.5,labels =c("Step 1","Step 2"))plot_grid(title_gg, gridded, nrow =2, rel_heights =c(0.1, 0.9))```# AnalysisNote, we plan to use the robust/sandwich variance estimator for regression models. One inclusion criteria is that the outcome variable ("Y") must be available for a subject to be included in the analysis question (ie. if they did not report a step 2 score, we won't perform relevant step 2 analysis on them).::: panel-tabset## Q1 Does order have an impact on Step 2 scores?We cannot analyze Step 1 since all survey responses reported passing For Step 2 scores, I will perform a linear regression with:- Y = Step 2 score- X = factor ( 1 = "step 1 first", 2 = "step 2 first", 3 = "only step 1", 4 = "only step 2")- Z = school (need to adjust for this)```{r exam_order_analysis}step2_complete[ ,"order_factor"] <-factor(step2_complete$exam_order, levels =c(1,2, 3, 4), labels =c("step 1 first", "step 2 first", "only step 1", "only step 2"))step2_complete[ ,"which_exam_first"] <-ifelse(step2_complete$order_factor =="step 1 first"| step2_complete$order_factor =="only step 1", 1, 2)step2_complete[ ,"school_factor"] <-factor(step2_complete$schoolid, labels =c("HMS", "UVA", "VUMS"))table(step2_complete$order_factor)table(step2_complete$which_exam_first)exam_order_mod <-lm(formula = score_step2 ~ which_exam_first + school_factor, data = step2_complete)sandwich(exam_order_mod) #this gives the sandwich variancesummary(exam_order_mod) #model summary for reporting```Based on the model output (p values), there doesn't appear to be any signifigant associations between exam order and the score for Step 2. Since the $R^2$ value is essentially 0, I conclude that there was no effect of exam order on step 2 scores.## Q2 What factors affect Step 2 score?Again, we cannot analyze Step 1 scores since all respondents reported passing.Based on our SAP, if there are any covariates with more than 30% of responses missing, we will drop that variable or populate it with 0, depending on context. For example, the percent of Amboss questions completed will be filled with 0 for people who didn't answer that they used Amboss since it seems safe to assume they didn't complete any of the Amboss questions. For people who did select that they used Amboss but didn't answer a percentage will be treated as missingIf less than 30% are missing, I may consider performing bootstrap sampling of known values to replace missing values.After accounting for missingness, I will assess for co-linearity of the predictors (ie. correlation) using VIF. If there is high co-linearity, we will use LASSO to perform variable selection. If there is no evidence of concerning levels of colinearity, I will proceed with linear regression.```{r data_profile_and_pairs_plots}#| warning: false#here I am creating the cleaned Uworld and Amboss variables. if they did not check that they used it, the percentage is set to 0.# if they did use it then the next ifelse checks if there is an NA where there should be a percent# if there is an NA where there should be a percent, we maintain this NA. If there is not, then we can report the percentage.# same process for uworldstep2_complete <- step2_complete %>%mutate(amboss_clean =ifelse(I.amboss ==0, 0, ifelse(is.na(step2_complete$amboss_percent_step2) ==TRUE, NA, step2_complete$amboss_percent_step2)),uworld_clean =ifelse(I.uworld ==0, 0, ifelse(is.na(step2_complete$uworld_percent_step2 ) ==TRUE, NA, step2_complete$uworld_percent_step2)))#inspecting percent missing, it seems like most responses are now complete enough for analysisstep2_complete$amboss_percent_step2 <-ifelse(is.na(step2_complete$amboss_percent_step2) ==TRUE, 0, step2_complete$amboss_percent_step2)class(step2_complete$practicetest_step2) <-"integer"step2_complete[,"on_target"] <-factor(step2_complete$target_score_step2, levels =c(1,2,3), labels =c("at target", "above target", "below target"))# I looked into the difference between practice_test_step2 (the final practice test I took before my exam was... 8 options) and practicetest_step2 (text response of how many practice tests did you take before step 2).# I decided to use practicetest_step2 (text response of how many practice tests did you take before step 2) but extract and recode it as numeric belowstep2_complete[, "practice_test_2_clean"] <-ifelse(is.na(step2_complete$practicetest_step2) ==TRUE, NA, substr(step2_complete$practicetest_step2, start =1, stop =1))step2_complete[, "number_of_practice_tests"] <-as.numeric(step2_complete$practice_test_2_clean) #I want to include the number of practice tests as just one numeric covariate, not coded as a factor like it is in practice_test_2_cleanstep2_complete$full_practice_test <-factor(step2_complete$full_test_practice_step2 , levels =c(1:4), labels =c("Yes and I'm glad I did", "Yes and it was unnecessary", "No and I wish I did", "No and I'm glad I did not"))step2_complete$simulate_full_practice <-ifelse(step2_complete$full_test_practice_step2 <=2, 1, 0)step2_complete$simulate_full_practice <-factor(step2_complete$simulate_full_practice, levels =c(0, 1), labels =c("No", "Yes"))step2_complete$length_step2 <-factor(step2_complete$length_step2 , levels =c(1:6), labels =c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks"))######## profile missingness in the step 2 data and addressstep2_profile <- step2_complete %>%relocate(c(score_step2, uworld_clean, amboss_clean, length_step2, simulate_full_practice, practice_score_step2, number_of_practice_tests, school_factor), .before = schoolid)percents_missing <-round(colSums(is.na(step2_profile))/nrow(step2_profile), 3)*100kable(percents_missing, caption ="Percent missing observations for pooled Step 2 survey")```Multiple linear regression with:- Y = Step 2 score (from "What was your 3-digit score on the official Step 2 exam?")- X1 = % UWorld (recoded from "Approximately what % of UWorld did you complete during your protected study period for step 2?")- X2 = % Amboss (recoded from "Approximately what % of the Amboss question bank did you complete during your Step 2 protected study period?")- X3 = length study (factor from "How long did you study for Step 2 during a protected study period?")- X4 = number of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")- X5 = full test day (yes/no code as binary from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?")- X6 = final practice score (What was your 3 digit score on the final practice test you took before the Step 2 exam?)- Z = School (need to adjust for this in order to remove any influence)Careful to note that not all cases are complete - for example there are 399 responses in the complete step 2 dataset, of which for the number of practice tests taken, 108 are missing and 291 have a response recorded.Below I report the model results, sandwich variance, and VIF for step 2 scores model. After that, I show descriptive statistics for predictors and the outcome that was included in the model.```{r}mod_step2_scores <-lm(score_step2 ~ uworld_clean + amboss_clean + length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor, data = step2_complete) summary(mod_step2_scores)sandwich(mod_step2_scores)vif(mod_step2_scores)mod_step2_scores %>%tbl_regression() #this makes a cleaner looking summary table```Trying the model with imputation, but there is an issue right now with just imputing one value instead of randomly sampling...::: callout-noteimputation needs to be fixed before interpreting this model! see the snippet of the updated data showing this issue:::```{r}#following this thought process for the variables that had missingness for the 2nd model# df %>% mutate(var = ifelse(!is.na(var),var, sample(var[!is.na(var)],1)))# as described on this blog https://community.rstudio.com/t/replacing-nas-with-random-values-from-a-set/158833/2 ## model variables were uworld_clean + amboss_clean + length_step2 + simulate_full_practice + practice_score_step2 + number_of_practice_tests + school_factor# with missingness in uworld_clean, amboss_clean, length_step2, practice_score_step2, and number_of_practice_testsstep2_impute <- step2_complete %>%select( score_step2, uworld_clean, amboss_clean, length_step2, simulate_full_practice, practice_score_step2, number_of_practice_tests, school_factor ) %>%mutate(uworld =ifelse(!is.na(uworld_clean), uworld_clean, sample(uworld_clean[!is.na(uworld_clean)], 1)),amboss =ifelse(!is.na(amboss_clean), amboss_clean, sample(amboss_clean[!is.na(amboss_clean)], 1)),length =factor(ifelse(!is.na(length_step2), length_step2, sample(length_step2[!is.na(length_step2)], 1)), levels =c(1:6), labels =c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks")),prac_score =ifelse(!is.na(practice_score_step2), practice_score_step2, sample(practice_score_step2[!is.na(practice_score_step2)], 1)),num_tests =ifelse(!is.na(number_of_practice_tests), number_of_practice_tests, sample(number_of_practice_tests[!is.na(number_of_practice_tests)], 1)) )colSums(is.na(step2_impute)) #to confirm the random sampling replacement worked#here is a view of one change:# cbind(step2_impute$length_step2, step2_impute$length)cbind(step2_impute$practice_score_step2, step2_impute$prac_score)[1:22,]mod_impute_step2_scores <-lm(score_step2 ~ uworld + amboss + length + simulate_full_practice + prac_score + num_tests + school_factor, data = step2_impute) summary(mod_impute_step2_scores)sandwich(mod_impute_step2_scores)vif(mod_impute_step2_scores)mod_impute_step2_scores %>%tbl_regression() #this makes a cleaner looking summary table```## Q3 What is associated (in this data) with pushing back a Step 1 exam date?Here, I will perform logistic regression withY = yes or no (1 = yes, 2 = no for "push_step1")There may not be sufficient data on this since only 20 people responded that they decided to push back Step 1. The factors that were measured are:- push remember step1 (1 = I only remember the form name, 2 = I only remember the score, 3 = I remember the form name and the score, 4 = I don't remember either) - we decided not to include this variable (can change later if desired)- push score only step 1 (1 = NBME, 2 = Uworld)- push practice test step 1 ( 1 - 8 listing various exams)- push nbme practice score (from 0 to 100%)- push uw practice score (from 180 to 300)Listing variables by name and if I have included them:- "push_step1_1" yes- "push_remember_step1_1" not included b/c a precursor question- "push_score_only_step1_1" not currently included but could be- "push_practice_test_step1_1" yes- "push_nbme_practice_score_step1_1" yes- "push_uw_practice_score_step1_1" yes```{r}step1_complete$push_step1 <-ifelse(step1_complete$push_step1 ==2|is.na(step1_complete$push_step1) ==TRUE, 2, 1) #recording the NA's to be "No" (they did not push back step 1)step1_complete$push_step1_label <-factor(step1_complete$push_step1, levels =c(1, 2), labels =c("Yes", "No")) #making a nice descriptive label#did_push_df <- step1_complete %>% filter(push_step1_label == "Yes")#dat %>% filter(!is.na(push_remember_step1_1))step1_complete$push_practice_test_step1 <-factor(step1_complete$push_practice_test_step1, levels =c(1:8), labels =c("NBME 25", "NBME 26", "NBME 27", "NBME 28", "NBME 29", "NBME 30", "UWorld 1", "UWorld 2"))#making labels for descriptive tableunits(step1_complete$push_uw_practice_score_step1) <-"score units"units(step1_complete$push_nbme_practice_score_step1) <-"percent"label(step1_complete$push_practice_test_step1) <-"exam that triggered pushing back"label(step1_complete$push_nbme_practice_score_step1) <-"NBME P(passing) that triggered pushing back"label(step1_complete$push_uw_practice_score_step1) <-"3 digit UWorld score that triggered pushing back"caption <-"Description of indiviuals that pushed back Step 1"table1(~ push_practice_test_step1 + push_uw_practice_score_step1 + push_nbme_practice_score_step1 |push_step1_label, data=step1_complete, topclass="Rtable1-zebra")```## Descriptive Statistics (now just needs to look uniform)Descriptive statistics will be reported in total and not by school (as requested by the collaborators).- Question banks - % UWorld for step 1 and 2 - % Amboss for step 1 and 2- length of study for step 1 and step 2 (barplot with frequency table)- number of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")- full test day as factor from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"- histogram of Step 2 scores- frequency table of Step 1 scores- what resources are most widely used barplot- number of practice tests histogram (among the people who answered the question) for step 1- summarize comments for other_resources, other_step, changes_stepPractice Scores for Step 1 and Step 2- probability of passing for step 1 - final_nbme_practice_score_step1 is a percent from 0 to 100 - final_uw_practice_score_step1 is a 3 digit UWorld score for step 1- final practice score for step 2 (3 digit score)looking at the data dictionary, we are given final_nbme_practice _score_step1 and final_uw_practice_s core_step1 which are two different covariates which are ONLY in Step 1.Since everyone passed step 1, I did not run a regression model that included these covariates. Instead, I will make a plot to show this in descriptive data.These specific questions were not included for step 2, instead there is a question for practice_score_step2, which I do include in the model for step 2 score predictors above.```{r}# Here is how I am planning to organize it:# 1 Question banks# - % UWorld for step 1 and 2# - % Amboss for step 1 and 2# 2 length of study for step 1 and step 2 (barplot with frequency table)# 3 # of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")# 4 number of practice tests histogram (among the people who answered the question) for step 1# 5 full test day as factor from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"# 6 histogram of Step 2 scores# 7 frequency table of Step 1 scores# 8 what resources are most widely used barplot# 9 practice score step 2# 10 summarize comments for other_resources, other_step, changes_step# 11 Practice Scores for Step 1 and Step 2# 11a probability of passing for step 1# 11b final_nbme_practice_score_step1 is a percent from 0 to 100# 11c final_uw_practice_score_step1 is a 3 digit UWorld score for step 1# 11d final practice score for step 2 (3 digit score)#### and these are covered in other plots already (as numbered)# 11b looking at the data dictionary, we are given final_nbme_practice _score_step1 and final_uw_practice_s core_step1 which are two different covariates which are ONLY in Step 1.# 7 Since everyone passed step 1, I did not run a regression model that included these covariates. Instead, I will make a plot to show this in descriptive data.# 9 These specific questions were not included for step 2, instead there is a question for practice_score_step2, which I do include in the model for step 2 score predictors above.############################ now see the plots below as organized by the list I made above# 1 Question banks# - % UWorld for step 1 and 2# - % Amboss for step 1 and 2qudf <-data.frame(amount =c(step1_complete$uworld_percent_step1, step1_complete$amboss_percent_step1, step2_complete$uworld_percent_step2, step2_complete$amboss_percent_step2), step_exam =c(rep("Step 1", times =nrow(step1_complete)*2), rep("Step 2", times =nrow(step2_complete)*2)), resource_name =c(rep("UWorld", times =nrow(step1_complete)), rep("Amboss", times =nrow(step1_complete)), rep("Uworld", times =nrow(step2_complete)), rep("Amboss", times =nrow(step2_complete))))ggplot(qudf, aes(amount)) +geom_histogram() +facet_wrap(~step_exam + resource_name) +theme_minimal() +scale_fill_brewer(palette="Blues")# 2 length of study for step 1 and step 2 (barplot with frequency table)## how long did the students study barplot# for step 1step1_complete$length_step1 <-factor(step1_complete$length_step1 , levels =c(1:6), labels =c("less than 1 week", "1-2 weeks", "3-4 weeks", "5-6 weeks", "7-8 weeks", "more than 8 weeks"))ggplot(data.frame(step1_complete), aes(x=length_step1)) +geom_bar() +theme_minimal() +xlab("Time") +ylab("Frequency") +labs(title ="How long did you study for Step 1 during a protected study period?")# for step 2ggplot(data.frame(step2_complete), aes(x=length_step2)) +geom_bar() +theme_minimal() +xlab("Time") +ylab("Frequency") +labs(title ="How long did you study for Step 2 during a protected study period?")# 3 # of practice tests (recoded from a text answer from "How many total practice tests did you take before Step 2?")ggplot(aes(x = number_of_practice_tests), data= step2_complete) +geom_histogram() +theme_minimal() +xlab("Number of tests") +ylab("Frequency") +labs(title ="How many total practice tests did you take before Step 2?")# 4 number of practice tests histogram (among the people who answered the question) for step 1ggplot(aes(x = practicetest_step1), data= step1_complete) +geom_histogram() +theme_minimal() +xlab("Number of tests") +ylab("Frequency") +labs(title ="How many total practice tests did you take before Step 1?") ## number of practice tests histogram (among the people who answered the question)# 5 full test day as factor from "Did you simulate a full Step 2 test day experience, e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?" "full_practice_test"ggplot(aes(x = full_practice_test), data= step2_complete) +geom_bar() +theme_minimal() +xlab("Response") +ylab("Frequency") +labs(title ="Did you simulate a full Step 2 test day experience, \n e.g. 8 blocks of 40 UWorld questions or 2 of the 4-section practice tests?")# 6 histogram of Step 2 scoresggplot(aes(x = score_step2), data= step2_complete) +geom_histogram() +theme_minimal() +xlab("Self Reported Step 2 Score") +ylab("Frequency") +labs(title ="Frequency of reported Step 2 Scores")# 7 frequency table of Step 1 scoreskable(table(step1_complete$score_step1), caption ="Frequency of passing Step 1") # would be nice to have a way to summarize these in the same table, but this can just be remade with a table in Wordkable(table(is.na(step1_complete$score_step1)), caption ="True indicates missing values for Step 1 score")# 8 what resources are most widely used barplot resources_2 <- step2_complete %>%select(I.uworld:I.other) %>%summarise(across(everything(), ~sum(.)))resources_1 <- step1_complete %>%select(I.uworld:I.other) %>%summarise(across(everything(), ~sum(.)))rdf <-data.frame(amount =c(t(resources_1), t(resources_2)), step_exam =c(rep("Step 1", 8), rep("Step 2", 8)), resource_name =rep(names(resources_1), times =2))# http://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization used this as a guide#what resources are most widely used? I want to sort this barplot in order of frequency but for some reason this wasn't working for me with reorder()ggplot(data=rdf, aes(x=resource_name, y=amount, fill = step_exam)) +geom_bar(stat="identity", position=position_dodge())+theme_minimal() +coord_flip() +scale_fill_brewer(palette="Blues")# 9 practice score step 2ggplot(aes(x = practice_score_step2), data= step2_complete) +geom_histogram() +scale_fill_brewer(palette="Blues") +theme_minimal() +xlab("Self Reported Step 2 Practice Score") +ylab("Frequency") +labs(title ="Reported Step 2 Practice Scores")# 10 summarize comments for other_resources, other_step, changes_step## summarize comments (other_resources, other_step, changes_step)#"other_resources_step1" "other_step1" "changes_step1" kable(table(step1_complete$other_resources_step1), caption ="Other listed resources for Step 1")#If you used other resources not listed above, would you use them again? step1_complete$other_hascontents <-as.integer(nchar(step1_complete$other_resources_step1) >0) #flagging the rows that have contents https://stackoverflow.com/questions/64744988/testing-to-see-if-characters-are-present-in-a-cell-in-ras_tibble(step1_complete[which(step1_complete$other_hascontents ==1), c("other_resources_step1", "other_step1") ])# what would you change? step1_complete$change_hascontents <-as.integer(nchar(step1_complete$changes_step1) >0) as.data.frame(step1_complete[which(step1_complete$change_hascontents ==1), "changes_step1" ])#dat$study_amount_step1_1# 11 Practice Scores for Step 1 and Step 2# 11a final_nbme_practice_score_step1 is a percent from 0 to 100# 11b final_uw_practice_score_step1 is a 3 digit UWorld score for step 1# 11c final practice score for step 2 (3 digit score)hist(step1_complete$final_nbme_practice_score_step1, main ="What was the probability of passing you received \n on your final NBME form before the Step 1 exam?", breaks =40)kable(table(!is.na(step1_complete$final_nbme_practice_score_step1)), caption ="More responses are missing for P(passing step 1) from NBME") #so 118 are missing and 64 are therehist(step1_complete$final_uw_practice_score_step1, main ="What was the 3 digit score you received on your final UWorld form before the Step 1 exam?", breaks =40)kable(table(is.na(step1_complete$final_uw_practice_score_step1)), caption ="Most responses are missing for UWorld Step 1 final score")hist(step2_complete$practice_score_step2, main ="What was your 3 digit score on the final practice test \n you took before the Step 2 exam?", breaks =40)kable(table(is.na(step2_complete$practice_score_step2)), caption ="there are 41 missing final step2 practice scores")```# Appendix/notesAll the analyses are performed using the following:- R version 4.2.2 (2022-06-24); R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.- Harrell Jr FE (2022). rms: Regression Modeling Strategies. R package version 6.3-0, <https://CRAN.R-project.org/package=rms>.The table below lists packages used in this document.```{r}subset(data.frame(sessioninfo::package_info()), attached==TRUE, c(package, loadedversion))```